Column Heterogeneity as a Measure of Data Quality

نویسندگان

  • Bing Tian Dai
  • Nick Koudas
  • Beng Chin Ooi
  • Divesh Srivastava
  • Suresh Venkatasubramanian
چکیده

Data quality is a serious concern in every data management application, and a variety of quality measures have been proposed, including accuracy, freshness and completeness, to capture the common sources of data quality degradation. We identify and focus attention on a novel measure, column heterogeneity, that seeks to quantify the data quality problems that can arise when merging data from different sources. We identify desiderata that a column heterogeneity measure should intuitively satisfy, and discuss a promising direction of research to quantify database column heterogeneity based on using a novel combination of cluster entropy and soft clustering. Finally, we present a few preliminary experimental results, using diverse data sets of semantically different types, to demonstrate that this approach appears to provide a robust mechanism for identifying and quantifying database column heterogeneity.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Study of Gas Flow in a Slurry Bubble Column Reactor for the DME Direct Synthesis: Mathematical Modeling from Homogeneity vs. Heterogeneity Point of View

In the present study, a heterogeneous and homogeneous gas flow dispersion model for simulation and optimization of a large-scale catalytic slurry reactor for the direct synthesis of dimethyl ether (DME) from synthesis gas (syngas) and CO2, using a churn-turbulent regime was developed. In the heterogeneous flow model, the gas phase was distributed into two bubble phases including small and large...

متن کامل

Real-time quality monitoring in debutanizer column with regression tree and ANFIS

A debutanizer column is an integral part of any petroleum refinery. Online composition monitoring of debutanizer column outlet streams is highly desirable in order to maximize the production of liquefied petroleum gas. In this article, data-driven models for debutanizer column are developed for real-time composition monitoring. The dataset used has seven process variables as inputs and the outp...

متن کامل

شیوع کیفیت خواب نامطلوب در دانشجویان دانشگاه‌های ایران: مرور ساختاریافته و متاآنالیز

Background: Students sleep pattern, due to the stress of studying and teaching workload are different with other non-student peers. The aim of this study was to determine the prevalence of poor sleep quality in college students of Iran by a meta-analysis study, to be as a final measure for policy makers in this field. Methods: In this meta-analysis study, the databases of PubMed, Science Direct...

متن کامل

Developing a Model of Heterogeneity in Driver’s Behavior

Intelligent Driver Model (IDM) is a well-known microscopic model of traffic flow within the traffic engineering societies. While it is a powerful technique for modeling traffic flows, the Intelligent Driver Model lacks the potential of accommodating the notion of drivers’ heterogeneous behavior whenever they are on roads. Concerning the above mentioned, this paper takes the lane to recognize th...

متن کامل

Dimensionality analysis of subsurface structures in magnetotellurics using different methods (a case study: oil field in Southwest of Iran)

Magnetotelluric (MT) method is an electromagnetic technique that uses the earth natural field to map the electrical resistivity changes in subsurface structures. Because of the high penetration depth of the electromagnetic fields in this method (tens of meters to tens of kilometers), the MT data is used to investigate the shallow to deep subsurface geoelectrical structures and their dimensions....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006